Week 8.6 - Hands-On: Activities & Assessment

What We'll Cover

This session applies everything from the week: multimodal AI capabilities, the specific failure modes in each modality, and what verification looks like when AI is working with non-text data. Each activity is designed to produce a specific kind of learning — not just "I used the tool" but "I understand where this tool can mislead me."

You will work through three structured activities (figure analysis, transcription verification, and document extraction), then complete the weekly assessment which asks you to apply multimodal AI to your own research materials and document what you find.

📊 Activity 1: Figure Analysis Challenge

🔎 "Reading vs. Understanding"

Materials: You will use three publicly available scientific figures of increasing complexity. Instructions are provided for finding suitable figures from open-access papers in your field. Choose figures that represent the types of visualisations common in your research area — a simple bar chart, a multi-line plot with several series, and a complex figure (heatmap, network diagram, or multi-panel figure).

Task — for each figure:

Use Claude or GPT (or equivalent frontier multimodal model) to describe what the figure shows. Record the AI's output verbatim.
Use the AI to extract specific numerical values visible in the figure — axis values, data points, percentages, sample sizes. Record these alongside the values you read from the figure yourself.
Verify the extracted values against the figure's underlying data where possible (the paper's methods section, supplementary data, or the figure's own axis labels).
Ask the AI to explain the main finding the figure is communicating. Compare its interpretation with the paper's own discussion of that figure.

What to look for:

Where does the AI describe qualitative patterns accurately but misread quantitative details?
Does it read the scale and axis labels correctly? Does it handle logarithmic scales?
Does it confuse different series in a multi-line chart or misidentify which colour corresponds to which group?
Does its "main finding" interpretation match what the authors actually say in the paper?

💬 Discussion Prompt

When CharXiv launched in 2024, leading models scored 47.1% on scientific chart reasoning vs. 80.5% for humans. Frontier model scores have improved substantially since — so the headline gap will not match what you see today. The more interesting question for your forum post is: which features of your test charts did the AI handle reliably, and which did it still get wrong? Bring specific examples and note the model and date you tested.

🎤 Activity 2: Transcription and Verification

🔍 "The Hallucination Hunt — Test It On Yourself"

This exercise asks you to record your own voice and transcribe it. The pedagogical reason is direct: the lesson's claims about Whisper's performance on African-accented English are abstract until you hear your own words come back wrong. You will also generate your own ground truth (the script you wrote and read aloud), so verification is genuine.

Step 1 — Write a script (~3 minutes when read aloud, around 400–500 words). Treat it as if you are explaining a piece of your research to a colleague. Deliberately include the following challenge features — these are precisely the things Whisper is most likely to mishandle:

Technical jargon from your field — at least 5–10 specialist terms (e.g. haematopoiesis, heteroskedasticity, phenomenology, k-means clustering)
Proper nouns: 3–5 researcher surnames you would cite; 2–3 South African place names (Khayelitsha, Stellenbosch, Mthatha, Polokwane); your own institution name spelled out
Acronyms read as acronyms: UCT, SARS-CoV-2, fMRI, ANOVA, NRF
At least three homophone pairs in context: their/there, principal/principle, affect/effect, complement/compliment
One number-heavy sentence: e.g. "The 2024 cohort included 1,247 participants across 16 sites in eight provinces."
One sentence in your second language if you have one (Afrikaans, isiZulu, isiXhosa, Sesotho) — even a few words is enough to surface what Whisper does with code-switching

Step 2 — Record yourself reading the script. Use your phone or laptop microphone. Read at normal pace; do not over-articulate. Save the recording as a single audio file (.wav, .mp3, or .m4a). Aim for around 3 minutes of audio. If you have a quiet room, use it; if you do not, that is also useful data — background noise is one of the hallucination triggers covered in Sub-Lesson 4.

Step 3 — Transcribe with Whisper. Two options, both free:

Option A — Hugging Face Space (no install): upload your audio to huggingface.co/spaces/openai/whisper and select large-v3 as the model. Output appears in the browser.
Option B — Google Colab notebook: follow the course-provided notebook, which runs whisper on a free GPU. Slightly more setup, but you can save the output and re-run. Notebook URL in the course resources.

Step 4 — Compare line by line against your own script (your ground truth). Read both aloud if needed. Mark every difference. Categorise each error:

Substitution — wrong word in place of the right one
Deletion — word or phrase missing from the transcript
Insertion — hallucinated content not present in the audio (this is the most consequential category)
Other — punctuation, capitalisation, formatting

Step 5 — Calculate your own WER. WER = (substitutions + deletions + insertions) / total words in ground truth, expressed as a percentage. Compare your number to the headline benchmarks in Sub-Lesson 4 (~2.0% on clean read speech). If your number is much higher, what features of your script and recording explain the difference?

🫠 Reflection (for your forum post)

Three questions to address: (1) Which of your challenge features tripped Whisper? Were homophones, jargon, proper nouns, or your second-language sentence the worst? (2) Did Whisper insert anything that was not said? Be specific — quote the inserted text. (3) If you had used this transcript for qualitative coding without checking it against your script, would your analysis have been corrupted in ways a reader could not detect from the transcript alone?

🇺🇪 Why This Matters for South African Research

South African English exists in many varieties, and most postgraduate students at UCT will speak some combination of accented English with code-switching from one or more African languages. The course materials cite published numbers; this exercise lets you measure them on yourself. If your Whisper output is materially worse than the headline 2.0% figure — and for many South African accents it will be — that is the central pedagogical finding. The numbers in published benchmarks are not your numbers.

📄 Activity 3: Document Extraction Challenge

📋 "The Table Problem"

Source paper: The COPCOV trial of hydroxychloroquine/chloroquine for COVID-19 prevention — an open-access randomised controlled trial published in PLOS Medicine (2024). Download the PDF directly from the publisher: journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1004428

You will extract two tables from this single PDF. Both look superficially tractable — neither has merged column headers or nested column groups — but both contain the kinds of structural features that everyday research tables actually have, and that AI extraction tools quietly mishandle. The point of the exercise is to see what your tool does with the features that show up in real papers, not the rare textbook-complex cases.

Table 1 — Demographic details of COPCOV study participants. Three data columns (HCQ/CQ arm, Placebo arm, Total). Watch for: bold section headers spanning all columns (Existing comorbidities, Baseline symptoms) that are visual organisers rather than data rows; implicit category grouping under labels like “Sex, n (%)” and “Smoking, n (%)”; embedded italics in row labels (n, N, IQR); en-dash ranges in cell values (e.g. 23–39); and a page break in the middle of the table with a “(Continued)” marker.
Table 2 — Prespecified endpoints in the intention-to-treat population. Five data columns (Outcome, HCQ/CQ, Placebo, Risk ratio, p-value). Watch for: multi-line cell content — each endpoint cell stacks the n/N count and the percentage with 95% CI on two visual lines but they are one logical cell; section divider rows (Secondary endpoints:) that span all data columns; bold sub-categorisation in row labels (Primary endpoint:, Tertiary endpoint:); footnote markers in cells (* after “All-cause respiratory illness”, ** after the p-value 0.0002); mixed CI delimiters (mostly “to”, but one cell uses an en-dash); and text wrapping in narrow columns that splits values across visual lines for layout reasons only.

Task:

Extract Table 1 using a tool of your choice. Suggested options: Docling for local processing (free, open-source, pip install docling), or direct upload of the PDF to Claude or GPT with a prompt asking for the table content as Markdown or CSV. Note which tool you used and any relevant settings.
Repeat for Table 2 using the same tool, so the comparison isolates the effect of table content rather than tool choice.
For each table, open the original PDF alongside the extracted output and verify cell by cell.
Calculate cell-level accuracy for each: what percentage of cells were correctly extracted — correct content, correct position in the structure, and correct relationship to row and column headers?
Identify which specific structural features in each table caused extraction failures, and whether the failure modes differ between the two tables.

What to document for your forum post:

Section headers as data rows: Did your tool treat Existing comorbidities, Baseline symptoms, and Secondary endpoints: as data rows, drop them silently, or mark them as headers? Each of these failure modes corrupts downstream analysis differently.
Page break (Table 1): Did the tool recognise that the table continues across the page break, or did it produce two disconnected tables? Was the “(Continued)” marker left in the output as if it were data?
Multi-line cells (Table 2): Were the n/N counts and the percentage-with-CI kept together as one logical cell, or split into separate rows? This is the most consequential extraction failure for this kind of table.
Footnote markers: Were the * and ** retained in the output? If retained, were they linked to their footnote text below the table, or left as orphan symbols?
Formatting fidelity: Were italics (n, N, IQR) preserved? Were en-dash ranges (23–39) preserved as ranges, or converted to other characters?
Cell-level accuracy gap: What is the gap between Table 1 and Table 2 with the same tool, and which structural features account for the difference?
The trust question: Would you use either extraction for quantitative meta-analysis without manual verification? Which kind of error would be hardest for a reader to detect after the fact?

📝 Weekly Assessment

Multimodal Analysis Report — 1000 Words

Use multimodal AI to process research materials relevant to your own research domain. You must use at least two different modalities (for example: an image and an audio recording; a PDF table and a scientific figure; a video and a document).

Your report should address:

1. What you did: which tools, which materials, what tasks you asked the AI to perform.

2. What it got right: where the AI's output was accurate and useful — be specific about what it handled well.

3. What it got wrong: document at least three specific errors or limitations you observed, with examples from your actual output.

4. Verification approach: how did you verify the AI's output? What would you have missed if you had not checked?

5. Research workflow implications: how would you integrate these tools into your research practice? What safeguards would you build in?

Submission: via Amathuba by [date].

Grading criteria: Accuracy of tool use (25%), quality of error documentation (30%), depth of verification practice (25%), research workflow integration (20%).

📚 Full Week Summary

What You Have Learned This Week

AI can now see images, read documents, transcribe audio, and process hours of video — but "reading" and "understanding" remain fundamentally distinct. The gap between impressive demonstrations and reliable research use is large, and this week has been about measuring that gap precisely.

The CharXiv benchmark (introduced in 2024 with leading models at 47.1% vs. 80.5% for humans) established the pattern that runs through this week: real-world performance lags simplified benchmarks. Frontier model scores on CharXiv have improved substantially since, but the underlying gap shows up again in each modality and is the framing to carry forward. Transcription hallucination is pervasive — every transcript used in research requires human verification, not spot-checking. African language ASR has a real performance gap relative to English, but specialist tools (Intron Sahara, Lelapa AI) and fine-tuning significantly close it. Complex tables remain the hardest document AI problem: merged cells and multi-level headers cause extraction failures that are invisible without side-by-side verification.

Video AI compounds failure modes from each modality — verify temporal claims especially, and do not assume that a large context window means equal attention across the full recording. The "lost in the middle" problem applies to video as much as to long documents.

The skill of the week is not memorising which tool to use. It is developing the verification habits that make multimodal AI trustworthy for research: checking every extracted number, reviewing every transcript against audio, verifying every timestamp, and documenting what you checked and how.

Next week — Week 9: We turn to sycophancy, reasoning failures, and the specific ways AI misleads even when it is not hallucinating. The question shifts from "did the AI get the data right?" to "did the AI reason about the data correctly?"